My Github repository for my assignments can be found at this URL: https://github.com/chimandy/compscix-415-2-assignments.git

library(mdsr)
library(tidyverse)
library(ggplot2)

Section 3.2.4 :all exercises

Question 1: Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

If I run ggplot(data = mpg), I see an empty grapth

Question 2:How many rows are in mpg? How many columns?

mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

There are 234 rows and 11 columns in mpg

Question 3: What does the drv variable describe? Read the help for ?mpg to find out.

?mpg

By running ?mpg, I found out drv varible descibes as below: There are basically 3 types of drv varible: f = front-wheel drive r = rear wheel drive 4 = 4wd

Question 4: Make a scatterplot of hwy vs cyl

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = hwy, y = cyl))

Question 5: What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = class, y = drv))

A scatterplot of class vs drv is not useful because class and drv are just two lists of models/types of cars with different drive transmission categories. Therefore, there is no studies or correlation between these two categorical varibales without comparing them against any countinuous variables, it will make the points on the plot overllapped with one another.

Section 3.3.1 all exercises

Question 1: What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The above code is wrong in order to change the apperance of the plot to blue because the color argurment is inside aes() instead of ouside aes() and within geom_point (). Therefore, the correct code should be as below:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Question 2 :Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

In mpg dataset, a.Categorical variables: manufacturer, model, trans, drv, fl, class. b.Continuous variables:displ,year,cyl,cty,hwy.

mpg
## # A tibble: 234 x 11
##    manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4      2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4      2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4      2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4      3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 q…   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 q…   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 q…   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

When you run mpg, categorical varibales are type , whereas cointinuous variables are either type or

Question 3: Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

I. Continuous varibale:

a. color:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = cty))

b. size:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, size = cty))

c. shape:

#ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = cty))
## mapping: x = ~displ, y = ~hwy, shape = ~cty 
## geom_point: na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

Error: There is an error showing “A continuous variable can not be mapped to shape”

II. Categorical variable:

a.color:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

b.size:

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

c.shape:

ggplot(data=mpg)+
  geom_point( mapping = aes(x=displ, y=hwy,size=class))
## Warning: Using size for a discrete variable is not advised.

Warning: there is a warning showing that “using size for a discrete variable is not advised”"

After mapping a continuous variable to color, size, and shape and mapping a catergorical variable to color, size, and shape, there are showing some major differences between these aestehtics. For example, continuous varibales like cty are visualized on a spectrum of color to identify different types of city miles per gallon (eg. a spectrum of blue),wheresas catergorical variables like class are binned into discrete categories of different colors for different type of cars.

Question 4. What happens if you map the same variable to multiple aesthetics?

ggplot(data=mpg)+
  geom_point( mapping = aes(x=displ, y=hwy, color=cty, size=cty))

If I map the same vaibale cty to multiple aesthetics like color or size, both aesthetics are implemented, and multiple legends are generated.

Question 5: What does the stroke aesthetic do ? What shapes does it work with ? ( hint: use ?geom_point)

?geom_point
ggplot(data=mpg)+
  geom_point(mapping=aes(x=displ, y=hwy), stroke=3, alpha=0.4, color='blue')

stroke adjusts the thicknees of the border for alpha that can take on different colors both inside and outside.

Question 6: What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.

ggplot(data = mpg)+
  geom_point(mapping = aes(x=displ, y=hwy, color=displ<5))

If you map if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5, R executes the code and creates a temporary variable containing the results of the operation. Here, the new variable takes on a value of TRUE if the engine displacement is less than 5 or FALSE if the engine displacement is more than or equal to 5.

Section 3.5.1:#4 and #5 only

Question 4: Take the first faceted plot in this section:

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

By using faceting instead of the colour aesthetic, there are advantages and disadvanatges as below:

a. Advantages :

Faceting splits the date into seperate grids for each differnt class and we can better visualizing and better micro-studying trend within each individual facet.

b. Disadvantages :

It’s harder to visualize the whole picture of the overall macro-relationship across all facets.

The color aesthetic is fine when your dataset is small, but with the enalarging and bigger datasets all points may begin to overlap with one another and the audience might get confused with insuficient presentation with a colored plot only when the datasets grows and just add additional color aesthetic to the plot.

Question 5:Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

?facet_wrap

nrow: It will define number of rows that facet plot should have.

ncol: It will define number of columns that facet plot should have

scales: Should scales be fixed (“fixed”, the default), free (“free”), or free in one dimension (“free_x”, “free_y”)?

shrink: If TRUE, will shrink scales to fit output of statistics, not raw data. If FALSE, will be range of raw data before statistical summary. labeller: A function that takes one data frame of labels and returns a list or data frame of character vectors. Each input column corresponds to one factor. Thus there will be more than one with formulae of the type ~cyl + am. Each output column gets displayed as one separate line in the strip label. This function should inherit from the “labeller” S3 class for compatibility with

labeller(). See label_value() for more details and pointers to other options.

as.table: If TRUE, the default, the facets are laid out like a table with highest values at the bottom-right. If FALSE, the facets are laid out like a plot with the highest value at the top-right.

switch: By default, the labels are displayed on the top and right of the plot. If “x”, the top labels will be displayed to the bottom. If “y”, the right-hand side labels will be displayed to the left. Can also be set to “both”.

drop: If TRUE, the default, all factor levels not used in the data will automatically be dropped. If FALSE, all factor levels will be shown, regardless of whether or not they appear in the data.

dir: Direction: either “h” for horizontal, the default, or “v”, for vertical.

strip.position: By default, the labels are displayed on the top of the plot. Using strip.position it is possible to place the labels on either of the four sides by setting strip.position = c(“top”, “bottom”, “left”, “right”)

The reason that facet_grid() doesn’t have nrow and ncol as facet_wrap because facet_grid() forms a matrix of panels defined by row and column faceting variables. It is most useful when you have two discrete variables, and all combinations of the variables exist in the data, whereas facet_wrap wraps a 1d sequence of panels into 2d. This is generally a better use of screen space than facet_grid() because most displays are roughly rectangular.

Section 3.6.1:#1-5

Question 1: What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

a.Line chart - geom_line() b.boxplot - geo_boxplot() c.histogram - geom_histogram() d.are chart -geom_area()

Question 2:Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Question 3: What does show.legend = FALSE do? What happens if you remove it?

Why do you think I used it earlier in the chapter?

Without show.legend = FALSE

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

With show.legend = FALSE

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv), show.legend = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

show.legend = FALSE removes the legend although The aesthetics are still mapped and plotted.When show.legend = FALSE is removed, the legend will shows a legend that explains which colors correspond to which values.It was used earlier in this chapter because it makes the comparasion presentation with the other 2 plots more clean and consistent.

Question 4: What does the se argument to geom_smooth() do?

It reconfirms the conditional arguement and determines whether or not to draw a confidence interval around the smoothing line.

Question 5: Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

No because they use the same data and mappings to produce the same plot. The ony difference is that the first one can avoid duplication by passing a set of mappings to ggplot() function to automatically reapply the global mappings to each geom in the graph.

Extra credit: #6

Question 6: Recreate the R code necessary to generate the following graphs.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(size=4) + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(size=4)+
  geom_smooth(aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=mpg,mapping=aes(x=displ, y=hwy, color=drv))+
  geom_point(size=4) + 
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=mpg,mapping=aes(x=displ, y=hwy))+
  geom_point(mapping= aes (color=drv), size=4) + 
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=mpg, mapping=aes(x=displ, y=hwy))+
  geom_point(aes(color=drv),size=4)+
  geom_smooth(aes(linetype=drv), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data=mpg, mapping=aes(x=displ,y=hwy))+
  geom_point(color="white", size=8)+
  geom_point(aes(color=drv),size=4)

Section 3.7.1:

Question 2:What does geom_col() do? How is it different to geom_bar()?

?geom_col

geom_col is one of two types of bar charts. It makes the heights of the bar represents values of the data.Meanwhile, geom_bar is another tye of bar chart that makes the height of the bar propositional to the number of cases in each group depending the weight of each value versus the weight of the sum of all values.

Answer these questions:

  1. Look at the data graphics at the following link: What is a Data Scientist. Please briefly critique the designer’s choices. What works? What doesn’t work? What would you have done differently?

The designer’s choices have some workable and some unworkable features. It works since he uses some bar charts to present some percenatges, and spectrum of blue color to differitate among diffrent values. It does not work since it lacks of ohter colors to better presentation and some studies are not presented in any plots or bar charts that it lacks convinceable presentation to the audience.I would use some bar charts or plots and other colors to make it more presentable.